Search CORE

58 research outputs found

A realistic assessment of methods for extracting gene/protein interactions from free text

Author: A Moschitti
AB Clegg
Adrian J Shepherd
AM Cohen
Andrew B Clegg
AS Yeh
B Settles
C Nédellec
D Rebholz-Schuhmann
H Jose
HL Johnson
J Ding
J Fluck
JD Kim
JD Kim
K Franzén
K Fundel
K Sagae
L Hunter
M Krallinger
N Domedel-Puig
R Bunescu
R Hoffmann
R Kabiljo
R Kabiljo
R Leaman
R Sætre
Renata Kabiljo
S Pyysalo
S Pyysalo
S Pyysalo
T Hara
WA Baumgartner
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

Background: The automated extraction of gene and/or protein interactions from the literature is one of the most important targets of biomedical text mining research. In this paper we present a realistic evaluation of gene/protein interaction mining relevant to potential non-specialist users. Hence we have specifically avoided methods that are complex to install or require reimplementation, and we coupled our chosen extraction methods with a state-of-the-art biomedical named entity tagger. Results: Our results show: that performance across different evaluation corpora is extremely variable; that the use of tagged (as opposed to gold standard) gene and protein names has a significant impact on performance, with a drop in F-score of over 20 percentage points being commonplace; and that a simple keyword-based benchmark algorithm when coupled with a named entity tagger outperforms two of the tools most widely used to extract gene/protein interactions. Conclusion: In terms of availability, ease of use and performance, the potential non-specialist user community interested in automatically extracting gene and/or protein interactions from free text is poorly served by current tools and systems. The public release of extraction tools that are easy to install and use, and that achieve state-of-art levels of performance should be treated as a high priority by the biomedical text mining community

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

UCL Discovery

PubMed Central

Birkbeck Institutional Research Online

Improving the efficiency of ILP systems

Author: A. Srinivasan
A. Srinivasan
C. Nédellec
H. Blockeel
M. Botta
P. Laag van der
S. Muggleton
V. Santos Costa
Publication venue
Publication date: 01/01/2003
Field of study

Inductive Logic Programming (ILP) is a promising technol-ogy for knowledge extraction applications. ILP has produced intelligiblesolutions for a wide variety of domains where it has been applied. TheILP lack of eciency is, however, a major impediment for its scalabilityto applications requiring large amounts of data. In this paper we pro-pose a set of techniques that improve ILP systems eciency and makethen more likely to scale up to applications of knowledge extraction fromlarge datasets. We propose and evaluate the lazy evaluation of examples,to improve the eciency of ILP systems. Lazy evaluation is essentiallya way to avoid or postpone the evaluation of the generated hypotheses(coverage tests).The techniques were evaluated using the IndLog system on ILP datasetsreferenced in the literature. The proposals lead to substantial eficiencyimprovements and are generally applicable to any ILP system

Crossref

Repositório Aberto da Universidade do Porto

Comparative analysis of five protein-protein interaction corpora

Author: A Rzhetsky
Antti Airola
C Blaschke
C Nédellec
D Klein
DJ Best
Filip Ginter
HL Johnson
J Ding
J Kim
Jari Björne
JD Wren
Juho Heimonen
K Fundel
KB Cohen
L Smith
M Light
N Daraselia
R Bunescu
R Ihaka
S Pyysalo
Sampo Pyysalo
Tapio Salakoski
WJ Wilbur
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Growing interest in the application of natural language processing methods to biomedical text has led to an increasing number of corpora and methods targeting protein-protein interaction (PPI) extraction. However, there is no general consensus regarding PPI annotation and consequently resources are largely incompatible and methods are difficult to evaluate. Results We present the first comparative evaluation of the diverse PPI corpora, performing quantitative evaluation using two separate information extraction methods as well as detailed statistical and qualitative analyses of their properties. For the evaluation, we unify the corpus PPI annotations to a shared level of information, consisting of undirected, untyped binary interactions of non-static types with no identification of the words specifying the interaction, no negations, and no interaction certainty. We find that the F-score performance of a state-of-the-art PPI extraction method varies on average 19 percentage units and in some cases over 30 percentage units between the different evaluated corpora. The differences stemming from the choice of corpus can thus be substantially larger than differences between the performance of PPI extraction methods, which suggests definite limits on the ability to compare methods evaluated on different resources. We analyse a number of potential sources for these differences and identify factors explaining approximately half of the variance. We further suggest ways in which the difficulty of the PPI extraction tasks codified by different corpora can be determined to advance comparability. Our analysis also identifies points of agreement and disagreement in PPI corpus annotation that are rarely explicitly stated by the authors of the corpora. Conclusions Our comparative analysis uncovers key similarities and differences between the diverse PPI corpora, thus taking an important step towards standardization. In the course of this study we have created a major practical contribution in converting the corpora into a shared format. The conversion software is freely available at <url>http://mars.cs.utu.fi/PPICorpora</url>.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Event extraction of bacteria biotopes: a knowledge-intensive NLP-based approach

Author: A Airola
A Culotta
AP Manine
AR Aronson
BJ Grosz
C Jacquemin
C Nédellec
D Bollegala
D Field
D Zelenko
G Erkan
I Segura-Bedmar
JD Kim
JO Korbel
K Fundel
K Liolios
M Torii
N Kambhatla
Pierre Warnier
R Bossy
R Bossy
S Aubin
S Lappin
SA Kripke
SP Lapage
T Hamon
T Ono
Wiktoria Golik
Y Lin
Z GuoDong
Zorana Ratkovic
Publication venue: BioMed Central
Publication date: 01/01/2012
Field of study

International audienceBackground: Bacteria biotopes cover a wide range of diverse habitats including animal and plant hosts, natural, medical and industrial environments. The high volume of publications in the microbiology domain provides a rich source of up-to-date information on bacteria biotopes. This information, as found in scientific articles, is expressed in natural language and is rarely available in a structured format, such as a database. This information is of great importance for fundamental research and microbiology applications (e.g., medicine, agronomy, food, bioenergy). The automatic extraction of this information from texts will provide a great benefit to the field

Crossref

Springer - Publisher Connector

PubMed Central

HAL Descartes

Hal-Diderot

Constructing a semantic predication gold standard from the biomedical literature

Author: A Jimeno
A Névéol
A Roberts
AR Aronson
AT McCray
B Rosario
C Bizer
C Friedman
C Nédellec
CB Ahlers
D Hristovski
D Maglott
D Rebholz-Schuhmann
G Hripcsak
Graciela Rosemblat
H Kilicoglu
Halil Kilicoglu
J Björne
J Cohen
JD Kim
JD Kim
JD Kim
JD Kim
JP Pestian
L Tanabe
LH Smith
M Bada
M Fiszman
Marcelo Fiszman
O Bodenreider
P Thompson
R Bunescu
S Pyysalo
T Cohen
T Wattarujeekrit
TC Rindflesch
TC Rindflesch
Thomas C Rindflesch
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Semantic relations increasingly underpin biomedical text mining and knowledge discovery applications. The success of such practical applications crucially depends on the quality of extracted relations, which can be assessed against a gold standard reference. Most such references in biomedical text mining focus on narrow subdomains and adopt different semantic representations, rendering them difficult to use for benchmarking independently developed relation extraction systems. In this article, we present a multi-phase gold standard annotation study, in which we annotated 500 sentences randomly selected from MEDLINE abstracts on a wide range of biomedical topics with 1371 semantic predications. The UMLS Metathesaurus served as the main source for conceptual information and the UMLS Semantic Network for relational information. We measured interannotator agreement and analyzed the annotations closely to identify some of the challenges in annotating biomedical text with relations based on an ontology or a terminology. Results We obtain fair to moderate interannotator agreement in the practice phase (0.378-0.475). With improved guidelines and additional semantic equivalence criteria, the agreement increases by 12% (0.415 to 0.536) in the main annotation phase. In addition, we find that agreement increases to 0.688 when the agreement calculation is limited to those predications that are based only on the explicitly provided UMLS concepts and relations. Conclusions While interannotator agreement in the practice phase confirms that conceptual annotation is a challenging task, the increasing agreement in the main annotation phase points out that an acceptable level of agreement can be achieved in multiple iterations, by setting stricter guidelines and establishing semantic equivalence criteria. Mapping text to ontological concepts emerges as the main challenge in conceptual annotation. Annotating predications involving biomolecular entities and processes is particularly challenging. While the resulting gold standard is mainly intended to serve as a test collection for our semantic interpreter, we believe that the lessons learned are applicable generally.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Mining clinical relationships from patient narratives

Author: A Rector
A Roberts
A Roberts
A Roberts
Angus Roberts
C Blaschke
C Friedman
C Giuliano
C Grover
C Nédellec
CB Ahlers
D Klein
D Lindberg
D Zelenko
Defense Advanced Research Projects Agency
G Doddington
G Zhou
H Cunningham
H Harkema
J Pustejovsky
J Thomas
K Fundel
M Goadrich
Mark Hepple
N Chinchor
N Sager
P Zweigenbaum
R Bunescu
R Gaizauskas
RC Bunescu
Robert Gaizauskas
S Katrenko
S Miller
S Pakhomov
T Rindflesch
T Wang
TC Rindflesch
U Hahn
W Chapman
Y Li
Y Lussier
Yikun Guo
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Background The Clinical E-Science Framework (CLEF) project has built a system to extract clinically significant information from the textual component of medical records in order to support clinical research, evidence-based healthcare and genotype-meets-phenotype informatics. One part of this system is the identification of relationships between clinically important entities in the text. Typical approaches to relationship extraction in this domain have used full parses, domain-specific grammars, and large knowledge bases encoding domain knowledge. In other areas of biomedical NLP, statistical machine learning (ML) approaches are now routinely applied to relationship extraction. We report on the novel application of these statistical techniques to the extraction of clinical relationships. Results We have designed and implemented an ML-based system for relation extraction, using support vector machines, and trained and tested it on a corpus of oncology narratives hand-annotated with clinically important relationships. Over a class of seven relation types, the system achieves an average F1 score of 72%, only slightly behind an indicative measure of human inter annotator agreement on the same task. We investigate the effectiveness of different features for this task, how extraction performance varies between inter- and intra-sentential relationships, and examine the amount of training data needed to learn various relationships. Conclusion We have shown that it is possible to extract important clinical relationships from text, using supervised statistical ML techniques, at levels of accuracy approaching those of human annotators. Given the importance of relation extraction as an enabling technology for text mining and given also the ready adaptability of systems based on our supervised learning approach to other clinical relationship extraction tasks, this result has significance for clinical text mining more generally, though further work to confirm our encouraging results should be carried out on a larger sample of narratives and relationship types

Crossref

Springer - Publisher Connector

PubMed Central

White Rose Research Online

All-paths graph kernel for protein-protein interaction extraction with evaluation of cross-corpus learning

Author: A Airola
A Yakushiji
AB Clegg
Antti Airola
AP Bradley
C Giuliano
C Nédellec
CD Meyer
D Zelenko
Filip Ginter
J Björne
J Ding
J Heimonen
JA Hanley
JAK Suykens
Jari Björne
JD Kim
JG Caporaso
K Fundel
KB Cohen
L Hirschman
L Hunter
M Lease
M Miwa
MC de Marneffe
P Zweigenbaum
R Bunescu
R Bunescu
R Bunescu
R Rifkin
R Sætre
S Pyysalo
S Pyysalo
S Pyysalo
S Van Landeghem
Sampo Pyysalo
T Gärtner
T Mitsumori
T Pahikkala
T Pahikkala
Tapio Pahikkala
Tapio Salakoski
Y Miyao
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

Investigating heterogeneous protein annotations toward cross-corpora utilization

Author: A Arnold
A Yeh
AM Cohen
B Alex
B Efron
C Nédellec
CJ Kuo
EFTK Sang
EW Noreen
F Rinaldi
F Sha
G Zhou
H Daumé III
H Shatkay
HL Johnson
J Wilbur
JD Kim
JD Kim
Jin-Dong Kim
Jun'ichi Tsujii
K Franzén
K Yoshida
KB Cohen
L Gillick
L Tanabe
MA Mandel
R Bunescu
R Bunescu
R Kabiljo
RTH Tsai
Rune Sætre
S Pyysalo
Sampo Pyysalo
T Ohta
V Hatzivassiloglou
X Sun
Y Song
Y Wang
Yue Wang
Publication venue: BioMed Central
Publication date: 01/12/2009
Field of study

Abstract Background The number of corpora, collections of structured texts, has been increasing, as a result of the growing interest in the application of natural language processing methods to biological texts. Many named entity recognition (NER) systems have been developed based on these corpora. However, in the biomedical community, there is yet no general consensus regarding named entity annotation; thus, the resources are largely incompatible, and it is difficult to compare the performance of systems developed on resources that were divergently annotated. On the other hand, from a practical application perspective, it is desirable to utilize as many existing annotated resources as possible, because annotation is costly. Thus, it becomes a task of interest to integrate the heterogeneous annotations in these resources. Results We explore the potential sources of incompatibility among gene and protein annotations that were made for three common corpora: GENIA, GENETAG and AIMed. To show the inconsistency in the corpora annotations, we first tackle the incompatibility problem caused by corpus integration, and we quantitatively measure the effect of this incompatibility on protein mention recognition. We find that the F-score performance declines tremendously when training with integrated data, instead of training with pure data; in some cases, the performance drops nearly 12%. This degradation may be caused by the newly added heterogeneous annotations, and cannot be fixed without an understanding of the heterogeneities that exist among the corpora. Motivated by the result of this preliminary experiment, we further qualitatively analyze a number of possible sources for these differences, and investigate the factors that would explain the inconsistencies, by performing a series of well-designed experiments. Our analyses indicate that incompatibilities in the gene/protein annotations exist mainly in the following four areas: the boundary annotation conventions, the scope of the entities of interest, the distribution of annotated entities, and the ratio of overlap between annotated entities. We further suggest that almost all of the incompatibilities can be prevented by properly considering the four aspects aforementioned. Conclusion Our analysis covers the key similarities and dissimilarities that exist among the diverse gene/protein corpora. This paper serves to improve our understanding of the differences in the three studied corpora, which can then lead to a better understanding of the performance of protein recognizers that are based on the corpora.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

CD5 expression promotes IL-10 production through activation of the MAPK/Erk pathway and upregulation of TRPC1 channels in B lymphocytes.

Author: A Bauch
A Kitabayashi
A Limnander
A Limnander
AF Muggen
AM Buhl
AS Roedding
B Apollonio
C Raman
DA Frank
E Campo
E Tibaldi
E Yildirim
GP Sims
H Gary-Gouy
H Gary-Gouy
H Gary-Gouy
H Gary-Gouy
H Gary-Gouy
HJ Gross
J Chen
JD Richards
JI Healy
JJ O’Shea
JJ Perez-Villar
K Hayakawa
K Kawauchi
K Parikh
KL Hippen
KW Moore
L Fayad
M Muzio
MJ Chumley
MR Sarrias
N Burdin
NS Roa
P Li
R Berland
RR Hardy
S Cheng
S Garaud
S Garaud
S Garaud
S Loisel
S Nédellec
SC Wong
T Defrance
V Devauchelle-Pensec
Y Mori
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 02/08/2016
Field of study

CD5 is constitutively expressed on T cells and a subset of mature normal and leukemic B cells in patients with chronic lymphocytic leukemia (CLL). Important functional properties are associated with CD5 expression in B cells, including signal transducer and activator of transcription 3 activation, IL-10 production and the promotion of B-lymphocyte survival and transformation. However, the pathway(s) by which CD5 influences the biology of B cells and its dependence on B-cell receptor (BCR) co-signaling remain unknown. In this study, we show that CD5 expression activates a number of important signaling pathways, including Erk1/2, leading to IL-10 production through a novel pathway independent of BCR engagement. This pathway is dependent on extracellular calcium (Ca2+) entry facilitated by upregulation of the transient receptor potential channel 1 (TRPC1) protein. We also show that Erk1/2 activation in a subgroup of CLL patients is associated with TRPC1 overexpression. In this subgroup of CLL patients, small inhibitory RNA (siRNA) for CD5 reduces TRPC1 expression. Furthermore, siRNAs for CD5 or for TRPC1 inhibit IL-10 production. These findings provide new insights into the role of CD5 in B-cell biology in health and disease and could pave the way for new treatment strategies for patients with B-CLL

Crossref

HAL-Université de Bretagne Occidentale

UCL Discovery

EUR Research Repository

Okina

Queen Mary Research Online

HAL UVSQ